11 research outputs found

    Dual-Channel Speech Enhancement Based on Extended Kalman Filter Relative Transfer Function Estimation

    Get PDF
    This paper deals with speech enhancement in dual-microphone smartphones using beamforming along with postfiltering techniques. The performance of these algorithms relies on a good estimation of the acoustic channel and speech and noise statistics. In this work we present a speech enhancement system that combines the estimation of the relative transfer function (RTF) between microphones using an extended Kalman filter framework with a novel speech presence probability estimator intended to track the noise statistics’ variability. The available dual-channel information is exploited to obtain more reliable estimates of clean speech statistics. Noise reduction is further improved by means of postfiltering techniques that take advantage of the speech presence estimation. Our proposal is evaluated in different reverberant and noisy environments when the smartphone is used in both close-talk and far-talk positions. The experimental results show that our system achieves improvements in terms of noise reduction, low speech distortion and better speech intelligibility compared to other state-of-the-art approaches.Spanish MINECO/FEDER Project TEC2016-80141-PSpanish Ministry of Education through the National Program FPU under Grant FPU15/0416

    A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality

    Get PDF
    This letter proposes a perceptual metric for speech quality evaluation, which is suitable, as a loss function, for training deep learning methods. This metric, derived from the perceptual evaluation of the speech quality algorithm, is computed in a per-frame basis and from the power spectra of the reference and processed speech signal. Thus, two disturbance terms, which account for distortion once auditory masking and threshold effects are factored in, amend the mean square error (MSE) loss function by introducing perceptual criteria based on human psychoacoustics. The proposed loss function is evaluated for noisy speech enhancement with deep neural networks. Experimental results show that our metric achieves significant gains in speech quality (evaluated using an objective metric and a listening test) when compared to using MSE or other perceptual-based loss functions from the literature.Spanish MINECO/FEDER (Grant Number: TEC2016-80141-P)Spanish Ministry of Education through the National Program FPU (Grant Number: FPU15/04161)NVIDIA Corporation with the donation of a Titan X GP

    Silent Speech Interfaces for Speech Restoration: A Review

    Get PDF
    This work was supported in part by the Agencia Estatal de Investigacion (AEI) under Grant PID2019-108040RB-C22/AEI/10.13039/501100011033. The work of Jose A. Gonzalez-Lopez was supported in part by the Spanish Ministry of Science, Innovation and Universities under Juan de la Cierva-Incorporation Fellowship (IJCI-2017-32926).This review summarises the status of silent speech interface (SSI) research. SSIs rely on non-acoustic biosignals generated by the human body during speech production to enable communication whenever normal verbal communication is not possible or not desirable. In this review, we focus on the first case and present latest SSI research aimed at providing new alternative and augmentative communication methods for persons with severe speech disorders. SSIs can employ a variety of biosignals to enable silent communication, such as electrophysiological recordings of neural activity, electromyographic (EMG) recordings of vocal tract movements or the direct tracking of articulator movements using imaging techniques. Depending on the disorder, some sensing techniques may be better suited than others to capture speech-related information. For instance, EMG and imaging techniques are well suited for laryngectomised patients, whose vocal tract remains almost intact but are unable to speak after the removal of the vocal folds, but fail for severely paralysed individuals. From the biosignals, SSIs decode the intended message, using automatic speech recognition or speech synthesis algorithms. Despite considerable advances in recent years, most present-day SSIs have only been validated in laboratory settings for healthy users. Thus, as discussed in this paper, a number of challenges remain to be addressed in future research before SSIs can be promoted to real-world applications. If these issues can be addressed successfully, future SSIs will improve the lives of persons with severe speech impairments by restoring their communication capabilities.Agencia Estatal de Investigacion (AEI) PID2019-108040RB-C22/AEI/10.13039/501100011033Spanish Ministry of Science, Innovation and Universities under Juan de la Cierva-Incorporation Fellowship IJCI-2017-3292

    Online multichannel speech enhancement combining statistical signal processing and deep neural networks

    Get PDF
    Speech-related applications on mobile devices require high-performance speech enhancement algorithms to tackle challenging real-world noisy environments. These speech processing techniques have to ensure good noise reduction capabilities with low speech distortion, thus improving the perceptual speech quality and intelligibility of the enhanced speech signal. In addition, current mobile devices often embed several microphones, allowing them to exploit the spatial information during the enhancement procedure. On the other hand, low latency and efficiency are requirements for extensive use of these technologies. Among the different speech processing paradigms, statistical signal processing offers limited performance under non-stationary noisy environments, while deep neural networks can lack generalization under real conditions. The main goal of this Thesis is the development of online multichannel speech enhancement algorithms for speech services in mobile devices. The proposed techniques use multichannel signal processing to increase the noise reduction performance without degrading the quality of the speech signal. Moreover, deep neural networks are applied in specific parts of the algorithm where modeling by classical methods would be, otherwise, difficult or very limited. This allows for the use of more capable deep learning methods in real-time online processing algorithms. Our contributions focus on different noisy environments where these mobile speech technologies can be applied. First, we develop a speech enhancement algorithm suitable for dual-microphone smartphones used in noisy and reverberant environments. The noisy speech signal is processed using a beamforming-plus-postfiltering strategy that exploits the dual-channel properties of the clean speech and noise signals to obtain more accurate acoustic parameters. Thus, the temporal variability of the relative transfer functions between acoustic channels is tracked by using an extended Kalman filter framework. Noise statistics are obtained by means of a recursive procedure using the speech presence probability. This speech presence is estimated through either statistical spatial models or deep neural network mask estimators, both exploiting dual-channel features from the noisy speech signal. Then, we propose a recursive expectation-maximization framework for online multichannel speech enhancement. The goal is the joint estimation of the clean speech statistics and the acoustic model parameters in order to increase robustness under non-stationary conditions. The noisy speech signal is first processed using a beamformer followed by a Kalman postfilter, which exploits the temporal correlations of the speech magnitude. The speech presence probability is then obtained using a deep neural network mask estimator, and its estimates are further refined through statistical spatial models defined for the noisy speech and noise signals. The resulting clean speech and speech presence estimates are then employed for maximum-likelihood estimation of beamformer and postfilter parameters. This also allows for an iterative procedure with positive feedback between the estimation of speech statistics and acoustic parameters. Scenarios with multiple overlapped speakers are also analyzed in this Thesis. Thus, beamforming with the model parameters obtained from deep neural network mask estimators is also explored. To deal with interfering speakers, we study the use of adapted mask estimators that exploit spectral and spatial information, obtained through auxiliary information, to focus on a target speaker. Therefore, additional speech processing blocks are integrated into the mask estimators so that the network can discriminate among different speakers. As an application, we consider the problem of automatic speech recognition in meeting scenarios, where our proposal can be used as a front-end processing. Finally, we study the training of deep learning methods for speech processing using perceptual considerations. Thus, we propose a loss function based on a perceptual quality objective metric. We evaluate the proposed loss for training deep neural network-based singlechannel speech enhancement algorithms in order to improve the speech quality perceived by human listeners. The two most common approaches for single-channel processing using these networks are considered: spectral mapping and spectral masking. We also explore the combination of different objective metric-related loss functions in a multi-objective learning training approach. To conclude, we would like to highlight that our contributions successfully integrate signal processing and deep learning methods to jointly exploit spectral, spatial, and temporal speech features. As a result, the set of proposed techniques provides us with a manifold framework for robust speech processing under very challenging acoustic environments, thus allowing us to improve perceptual quality, intelligibility, and distortion measures.Tesis Univ. Granada

    The role of window length and shift in complex-domain DNN-based speech enhancement

    Get PDF
    Deep learning techniques have widely been applied to speech enhancement as they show outstanding modeling capa- bilities that are needed for proper speech-noise separation. In contrast to other end-to-end approaches, masking-based meth- ods consider speech spectra as input to the deep neural network, providing spectral masks for noise removal or attenuation. In these approaches, the Short-Time Fourier Transform (STFT) and, particularly, the parameters used for the analysis/synthesis window, plays an important role which is often neglected. In this paper, we analyze the effects of window length and shift on a complex-domain convolutional-recurrent neural network (DCCRN) which is able to provide, separately, magnitude and phase corrections. Different perceptual quality and intelligibil- ity objective metrics are used to assess its performance. As a re- sult, we have observed that phase corrections have an increased impact with shorter window sizes. Similarly, as window overlap increases, phase takes more relevance than magnitude spectrum in speech enhancement.Project PID2019-104206GB-I00 funded by MCIN/AEI/10.13039/501100011033

    Online Multichannel Speech Enhancement Based on Recursive EM and DNN-Based Speech Presence Estimation

    Get PDF
    This article presents a recursive expectation-maximization algorithm for online multichannel speech enhancement. A deep neural network mask estimator is used to compute the speech presence probability, which is then improved by means of statistical spatial models of the noisy speech and noise signals. The clean speech signal is estimated using beamforming, single-channel linear postfiltering and speech presence masking. The clean speech statistics and speech presence probabilities are finally used to compute the acoustic parameters for beamforming and postfiltering by means of maximum likelihood estimation. This iterative procedure is carried out on a frame-by-frame basis. The algorithm integrates the different estimates in a common statistical framework suitable for online scenarios. Moreover, our method can successfully exploit spectral, spatial and temporal speech properties. Our proposed algorithm is tested in different noisy environments using the multichannel recordings of the CHiME-4 database. The experimental results show that our method outperforms other related state-of-the-art approaches in noise reduction performance, while allowing low-latency processing for real-time applications.Spanish MICINN/FEDER (Grant Number: PID2019-104206GB-I00)Spanish Ministry of Universities National Program FPU (Grant Number: FPU15/04161

    Dual-channel eKF-RTF framework for speech enhancement with DNN-based speech presence estimation

    Get PDF
    This paper presents a dual-channel speech enhance- ment framework that effectively integrates deep neural net- work (DNN) mask estimators. Our framework follows a beamforming-plus-postfiltering approach intended for noise reduction on dual-microphone smartphones. An extended Kalman filter is used for the estimation of the relative acous- tic channel between microphones, while the noise estimation is performed using a speech presence probability estimator. We propose the use of a DNN estimator to improve the prediction of the speech presence probabilities without making any assump- tion about the statistics of the signals. We evaluate and compare different dual-channel features to improve the accuracy of this estimator, including the power and phase difference between the speech signals at the two microphones. The proposed in- tegrated scheme is evaluated in different reverberant and noisy environments when the smartphone is used in both close- and far-talk positions. The experimental results show that our ap- proach achieves significant improvements in terms of speech quality, intelligibility, and distortion when compared to other approaches based only on statistical signal processing.Spanish Ministry of Science and Innovation Project No. PID2019-104206GB- I00/AEI/10.13039/501100011033Spanish Ministry of Uni- versities through the National Program FPU (grant reference FPU15/04161
    corecore